Search CORE

33 research outputs found

Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

Author: Aggarwal Divyanshu
Gupta Vivek
Kunchukuttan Anoop
Publication venue
Publication date: 05/06/2023
Field of study

Despite significant progress in Natural Language Generation for Indian languages (IndicNLP), there is a lack of datasets around complex structured tasks such as semantic parsing. One reason for this imminent gap is the complexity of the logical form, which makes English to multilingual translation difficult. The process involves alignment of logical forms, intents and slots with translated unstructured utterance. To address this, we propose an Inter-bilingual Seq2seq Semantic parsing dataset IE-SEMPARSE for 11 distinct Indian languages. We highlight the proposed task's practicality, and evaluate existing multilingual seq2seq models across several train-test strategies. Our experiment reveals a high correlation across performance of original multilingual semantic parsing datasets (such as mTOP, multilingual TOP and multiATIS++) and our proposed IE-SEMPARSE suite.Comment: 21 pages, 9 figures, 15 table

arXiv.org e-Print Archive

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Author: Khapra Mitesh M.
Kunchukuttan Anoop
Madhani Yash
Publication venue
Publication date: 26/10/2023
Field of study

We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script. For native-script text, it has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID is the first LID for romanized text in Indian languages. Two major challenges for romanized text LID are the lack of training data and low-LID performance when languages are similar. We provide simple and effective solutions to these problems. In general, there has been limited work on romanized text in any language, and our findings are relevant to other languages that need romanized language identification. Our models are publicly available at https://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training and test sets are also publicly available at https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses.Comment: Accepted to ACL 202

arXiv.org e-Print Archive

CTQScorer: Combining Multiple Features for In-context Example Selection for Machine Translation

Author: Dabre Raj
Kumar Aswanth
Kunchukuttan Anoop
Puduppully Ratish
Publication venue
Publication date: 21/10/2023
Field of study

Large language models have demonstrated the capability to perform on machine translation when the input is prompted with a few examples (in-context learning). Translation quality depends on various features of the selected examples, such as their quality and relevance, but previous work has predominantly focused on individual features in isolation. In this paper, we propose a general framework for combining different features influencing example selection. We learn a regression model, CTQ Scorer (Contextual Translation Quality), that selects examples based on multiple features in order to maximize the translation quality. On multiple language pairs and language models, we show that CTQ Scorer helps significantly outperform random selection as well as strong single-factor baselines reported in the literature. We also see an improvement of over 2.5 COMET points on average with respect to a strong BM25 retrieval-based baseline.Comment: Accepted to EMNLP 2023 finding

arXiv.org e-Print Archive